Word range

Definition

Word range refers to the number of texts in a corpus in which a particular word occurs. While word frequency is closely linked to language proficiency, it can be inflated by technical terms concentrated in a few documents. Word range helps mitigate this inflation and more accurately represents an individual's likely exposure to specific words.

Methodology

Word range is calculated as the number of documents or categories in which a word appears, often aggregated as a mean score per word in a text. Several corpora are used to derive these indices, and transformations such as log-scaling or lemma-based calculations are applied for normalization or variant types.

Corpus used

  • BNC
  • KF
  • COCA
  • SUBTLEXus
  • TOEFL11 Corpus

Register

Varies by corpus, including written/spoken registers, academic/non-academic texts, and proficiency levels.

Calculated indices

  • Replace [ ] with register (e.g., academic, fiction, spoken, written)

BNC

British National Corpus (BNC) is a 100-million-word collection of samples from a wide range of written and spoken British English from the late 20th century.

Raw range

  • Indices:
    • BNC_[ ]_Range_AW
    • BNC_[ ]_Range_CW
    • BNC_[ ]_Range_FW

KF

Kucera-Francis (KF) Corpus is based on the Brown Corpus and provides frequency norms from American English texts published around 1961.

  • Indices:
    • KF_Ncats_AW
    • KF_Ncats_CW
    • KF_Ncats_FW
    • KF_Nsamp_AW
    • KF_Nsamp_CW
    • KF_Nsamp_FW

COCA

The Corpus of Contemporary American English (COCA) includes more than one billion words from spoken, fiction, magazine, newspaper, and academic texts, offering frequency data for a variety of registers.

Raw range

  • Indices:
    • COCA_[ ]_Range_AW
    • COCA_[ ]_Range_CW
    • COCA_[ ]_Range_FW

Logarithmic range

  • Indices:
    • COCA_[ ]_Range_Log_AW
    • COCA_[ ]_Range_Log_CW
    • COCA_[ ]_Range_Log_FW

Lemma range

  • Indices:
    • COCA_[ ]_lemma_range_AW
    • COCA_[ ]_lemma_range_CW
    • COCA_[ ]_lemma_range_FW

Logarithmic lemma range

  • Indices:
    • COCA_[ ]_lemma_range_Log_AW
    • COCA_[ ]_lemma_range_Log_CW
    • COCA_[ ]_lemma_range_Log_FW

Lemma type range

  • Indices:
    • COCA_[ ]_lemma_range_AW_TP
    • COCA_[ ]_lemma_range_CW_TP
    • COCA_[ ]_lemma_range_FW_TP

Logarithmic lemma type range

  • Indices:
    • COCA_[ ]_lemma_range_Log_AW_TP
    • COCA_[ ]_lemma_range_Log_CW_TP
    • COCA_[ ]_lemma_range_Log_FW_TP

SUBTLEXus

SUBTLEXus is a subtitle-based corpus capturing word usage in spoken American English, ideal for understanding conversational frequency.

Raw range

  • Indices:
    • SUBTLEXus_Range_AW
    • SUBTLEXus_Range_CW
    • SUBTLEXus_Range_FW

Logarithmic range

  • Indices:
    • SUBTLEXus_Range_AW_Log
    • SUBTLEXus_Range_CW_Log
    • SUBTLEXus_Range_FW_Log

TOEFL11

The TOEFL11 Corpus is a learner corpus containing essays written by English language learners categorized by proficiency levels and L1 background.

Raw range

  • Indices:
    • NNS_Raw_Range_[High/Med/Low/WC]_AW
    • NNS_Raw_Range_[High/Med/Low/WC]_CW
    • NNS_Raw_Range_[High/Med/Low/WC]_FW

Logarithmic range

  • Indices:
    • NNS_Raw_Range_[High/Med/Low/WC]_AW_log
    • NNS_Raw_Range_[High/Med/Low/WC]_CW_log
    • NNS_Raw_Range_[High/Med/Low/WC]_FW_log

Lemma range

  • Indices:
    • NNS_Lemma_Range_[High/Med/Low/WC]_AW
    • NNS_Lemma_Range_[High/Med/Low/WC]_CW
    • NNS_Lemma_Range_[High/Med/Low/WC]_FW

Logarithmic lemma range

  • Indices:
    • NNS_Lemma_Range_[High/Med/Low/WC]_AW_log
    • NNS_Lemma_Range_[High/Med/Low/WC]_CW_log
    • NNS_Lemma_Range_[High/Med/Low/WC]_FW_log

Lemma type range

  • Indices:
    • NNS_Lemma_Range_Types_[High/Med/Low/WC]_AW
    • NNS_Lemma_Range_Types_[High/Med/Low/WC]_CW
    • NNS_Lemma_Range_Types_[High/Med/Low/WC]_FW

Logarithmic lemma type range

  • Indices:
    • NNS_Lemma_Range_Types_[High/Med/Low/WC]_AW_log
    • NNS_Lemma_Range_Types_[High/Med/Low/WC]_CW_log
    • NNS_Lemma_Range_Types_[High/Med/Low/WC]_FW_log

References

  • Eguchi, M., & Kyle, K. (2020). Continuing to explore the multidimensional nature of lexical sophistication: The case of oral proficiency interviews. The Modern Language Journal, 104(2), 381-400.
  • Kyle, K., & Crossley, S. A. (2015). Automatically assessing lexical sophistication: Indices, tools, findings, and application. Tesol Quarterly, 49(4), 757-786.